7  Obtaining Sport Data

“I am the greatest, I said that even before I knew I was.” - Muhammad Ali

Considered to be one of the greatest sports photos in history, Muhammad Ali reacts after his first round knockout of Sonny Liston during the 1965 World Heavyweight Title fight at St. Dominic’s Arena. Lewiston, Maine. May 25, 1965.

In sports analytics, the foundation of any meaningful analysis is the data itself. Whether we are predicting player performance, analyzing team strategies, or modeling game outcomes, obtaining high-quality sports data is the critical first step. Fortunately, the advent of open-source tools and the growing interest in sports analytics have made a wealth of data accessible to students and researchers alike. In this section, we will explore popular R packages that provide sports data, discuss external websites where additional datasets can be found, and outline key characteristics to consider when selecting a sports dataset for analysis.

7.2 External Websites for Sports Data

Beyond R packages, numerous websites provide sports data that can be downloaded and imported into R for analysis. These sources often offer raw data in formats like CSV, which can be read into R using read_csv() from the tidyverse. Below are some notable options:

  • Basketball-Reference.com: A comprehensive resource for basketball statistics, including NBA, NCAA, and international leagues. Data such as Paul George’s free throw attempts (used in Section 2) can be exported as CSV files. For example, navigate to a player’s page, select a season, and download the game log.

  • Baseball-Reference.com: The baseball counterpart to Basketball-Reference, offering detailed MLB statistics. It provides data similar to the Lahman package but with additional game-level granularity, such as pitch-by-pitch records.

  • ESPN.com: A broad source for real-time and historical data across sports like basketball, football, and baseball. ESPN’s API is accessible via packages like hoopR, but raw data can also be scraped or downloaded manually from their statistics pages.

  • Kaggle.com: A platform hosting user-contributed datasets, many of which are sports-related. Search for “sports analytics” to find datasets like the PGA golf data from Section 4 or cricket match records from Section 5. These datasets are often in CSV format and accompanied by descriptions of variables.

When working with external data, ensure it aligns with the tidy data principles we’ve used: each row should represent an observation (e.g., a game or player-season), and each column a variable.

7.3 What to Look for in a Sports Dataset

Selecting an appropriate dataset is crucial for conducting meaningful sports analytics. As you explore available data, consider the following characteristics to ensure it meets your analytical needs:

  1. Response Variable: The response variable is the outcome you aim to analyze or predict. For example, in Sections 1 and 4, we used continuous responses like batting average (BA) and FedEx Cup points (Points), while in Section 5, we modeled a binary response (Result, win/loss). Identify a clear response that aligns with your research question—whether it’s a player’s performance metric, a team’s win probability, or a game score.

  2. Predictor Variables (Features): These are the explanatory variables that influence the response. Look for datasets with a mix of numeric predictors (e.g., minutes, avgDriveDist) and categorical predictors (e.g., teamID, court). In Section 1, we used H (hits) and AB (at-bats) to compute batting averages, while in Section 6, minutes was a fixed effect predictor. Ensure the dataset includes features relevant to your hypothesis, such as player stats, game conditions, or team attributes.

  3. Number of Observations: The sample size affects the reliability of your analysis. A dataset with too few observations (e.g., < 30) may limit statistical power, while one with thousands (like nba_data from hoopR) offers robust insights. The PGA2022.csv dataset in Section 4 had 1,387 cases, sufficient for exploring golfer performance across tournaments. Aim for a balance: enough data to detect patterns but manageable for initial exploration.

  4. Data Quality: Check for missing values, inconsistencies, or errors. In Section 4, we used step_naomit() to remove rows with missing data in the PGA dataset. Assess the dataset’s completeness—variables like points or Run.Scored should have minimal gaps. Also, verify that the data is well-documented, with clear definitions for each column (e.g., driveSG as strokes gained off the tee).

  5. Hierarchical Structure: For mixed effects models (Section 6), seek datasets with clustering, such as players within teams or games within seasons. The nba_data from hoopR includes athlete_id and team_name, enabling random effects for players or teams. Similarly, the cricket_asia_cup.csv dataset in Section 5 groups matches by Year and Team.

  6. Temporal or Spatial Context: Sports data often includes a time dimension (e.g., yearID in Lahman) or location. These variables allow you to explore trends over time or differences across settings, as we did with Paul George’s free throw percentages by month in Section 2.

Obtaining sports data is both an art and a science. Packages like hoopR and Lahman provide ready-to-use datasets, while websites like Basketball-Reference and Kaggle offer flexibility for custom analyses. As you select a dataset, prioritize a clear response variable, relevant predictors, and sufficient observations, keeping an eye on data quality and structure.